The robust tagging of unrestricted text : the BNC experience

نویسنده

  • ROGER GARSIDE
چکیده

The production of annotated machine-readable corpora has been a central activity of the UCREL team at Lancaster University, led by Geoffrey Leech, since the early 1980s. This commenced with the annotation of the LoB (Lancaster-oslo/Bergen) corpus with part-ofspeech information over the period l98l-84 (Garside, Leech and Sampson 1987). The work has continued with corpora which introduced syntactic annotation at the constituent level (Leech and Garside l99l ) and at the level of anaphora (Fligelstone 1992; Garside 1993), with corpora marked with word sense and key semantic relationship information (Wilson and Rayson 1993), and with aligned multi-lingual corpora (McEnery et al 1994). Most recently it has resulted in the development of the British National Corpus (BNC), a corpus of one hundred million words of varied written and spoken texts annotated with part-of-speech information (Leech 1993). The BNC was constructed by a team of publishers (Oxford University Press, I-ongman, and Chambers Harrap), academic institutions (Lancaster University and Oxford University Computing Senrices) and the British Library, over the period l99l-94; Lancaster was responsible for the grammatical tagging of the corpus. The various types of annotation listed above have always been inserted in the text by a mixture of automatic and manual procedures, generally making use of some form of probabilistic technique to assign an analysis to a text at the appropriate linguistic level, and this can then be manually post-edited if neressary to achieve the desired level of accuracy. Because large quantities of text are involved, the post-editing will always involve some assistance by the computer to minimize the time taken to implement the analyst's decision, and to check the validity of each change. The results of the manual correction

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A domain-independent semantic tagger for the study of meaning associations in English text

A comparison of semantic tagging with syntactic Part-of-Speech tagging leads us to propose that a domain-independent semantic tagger for English corpora should not aim to annotate each word with an atomic ‘sem-tag’, but instead that a semantic tagging should attach to each word a set of semantic primitive attributes or features. These features should include: − lemma or root, grouping together ...

متن کامل

Claws4: The Tagging Of The British National Corpus

The main purpose of this paper is to describe the CLAWS4 general-purpose grammatical tagger, used for the tagging of the 100-million-word British National Corpus, of which c.70 million words have been tagged at the time of writing (April 1994)) We will emphasise the goals of (a) gener~d-purpose adaptability, (b) incorporation of linguistic knowledge to improve quality ,and consistency, and (c) ...

متن کامل

From Part of Speech Tagging to Memory-based Deep Syntactic Analysis

This paper presents a robust system for deep syntactic parsing of unrestricted French. This system uses techniques from Part-of-Speech tagging in order to build a constituent structure and uses other techniques from dependency grammar in an original framework of memories in order to build a functional structure. The two structures are build simultaneously by two interacting processes. The proce...

متن کامل

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...

متن کامل

The BNC Parsed with RASP4UIMA

We have integrated the RASP system with the UIMA framework (RASP4UIMA) and used this to parse the XML-encoded version of the British National Corpus (BNC). All original annotation is preserved, and parsing information, mainly in the form of grammatical relations, is added in an XML format. A few specific adaptations of the system to give better results with the BNC are discussed briefly. The RA...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004